Covid-19 Report¶

In this project we are going through the covid-19 data from the John Hopkins University to build a full world status report. This project is divided in 3 parts:

  • Setting up the data
  • Exploratory data analysis
  • Generating the report
  • Conclusion

Setting up the data¶

We start our project by loading the need packages and the John Hopkins University data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from fpdf import FPDF
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
import pycountry

confirmed_link = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
death_link = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
recovered_link = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

confirmed_df = pd.read_csv(confirmed_link)
confirmed_df.tail()
Out[1]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
284 NaN West Bank and Gaza 31.952200 35.233200 0 0 0 0 0 0 ... 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228
285 NaN Winter Olympics 2022 39.904200 116.407400 0 0 0 0 0 0 ... 535 535 535 535 535 535 535 535 535 535
286 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945
287 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 ... 343012 343012 343079 343079 343079 343135 343135 343135 343135 343135
288 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 ... 263921 264127 264127 264127 264127 264127 264127 264127 264276 264276

5 rows × 1147 columns

In [2]:
death_df = pd.read_csv(death_link)
death_df.tail()
Out[2]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
284 NaN West Bank and Gaza 31.952200 35.233200 0 0 0 0 0 0 ... 5708 5708 5708 5708 5708 5708 5708 5708 5708 5708
285 NaN Winter Olympics 2022 39.904200 116.407400 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
286 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 2159 2159 2159 2159 2159 2159 2159 2159 2159 2159
287 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 ... 4057 4057 4057 4057 4057 4057 4057 4057 4057 4057
288 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 ... 5663 5668 5668 5668 5668 5668 5668 5668 5671 5671

5 rows × 1147 columns

In [3]:
recovered_df = pd.read_csv(recovered_link)
recovered_df.tail()
Out[3]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
269 NaN West Bank and Gaza 31.952200 35.233200 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
270 NaN Winter Olympics 2022 39.904200 116.407400 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
271 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
272 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
273 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 1147 columns

We have 3 different dataframes containing information about the confirmed, the death and the recovered cases. The data format is such that there is a row per country and a column per date. There are 299 rows in the confirmed and death dataframes and 274 in the recovered dataframe, which means that we don't have the information for the recovered cases for all countries.

Let us reshape the data into a more suitable format, so that we can have a column with the dates and a column with the cases.

In [4]:
confirmed_df = pd.melt(confirmed_df, id_vars=confirmed_df.columns[0:4], value_vars=confirmed_df.columns[4:],
        var_name='Date', value_name='Cases')
confirmed_df.tail()
Out[4]:
Province/State Country/Region Lat Long Date Cases
330322 NaN West Bank and Gaza 31.952200 35.233200 3/9/23 703228
330323 NaN Winter Olympics 2022 39.904200 116.407400 3/9/23 535
330324 NaN Yemen 15.552727 48.516388 3/9/23 11945
330325 NaN Zambia -13.133897 27.849332 3/9/23 343135
330326 NaN Zimbabwe -19.015438 29.154857 3/9/23 264276
In [5]:
death_df = pd.melt(death_df, id_vars=death_df.columns[0:4], value_vars=death_df.columns[4:],
        var_name='Date', value_name='Cases')
death_df.tail()
Out[5]:
Province/State Country/Region Lat Long Date Cases
330322 NaN West Bank and Gaza 31.952200 35.233200 3/9/23 5708
330323 NaN Winter Olympics 2022 39.904200 116.407400 3/9/23 0
330324 NaN Yemen 15.552727 48.516388 3/9/23 2159
330325 NaN Zambia -13.133897 27.849332 3/9/23 4057
330326 NaN Zimbabwe -19.015438 29.154857 3/9/23 5671
In [6]:
recovered_df = pd.melt(recovered_df, id_vars=recovered_df.columns[0:4], value_vars=recovered_df.columns[4:],
        var_name='Date', value_name='Cases')
recovered_df.tail()
Out[6]:
Province/State Country/Region Lat Long Date Cases
313177 NaN West Bank and Gaza 31.952200 35.233200 3/9/23 0
313178 NaN Winter Olympics 2022 39.904200 116.407400 3/9/23 0
313179 NaN Yemen 15.552727 48.516388 3/9/23 0
313180 NaN Zambia -13.133897 27.849332 3/9/23 0
313181 NaN Zimbabwe -19.015438 29.154857 3/9/23 0

Now let us check for any missing values.

In [7]:
confirmed_df.isna().sum()
Out[7]:
Province/State    226314
Country/Region         0
Lat                 2286
Long                2286
Date                   0
Cases                  0
dtype: int64
In [8]:
death_df.isna().sum()
Out[8]:
Province/State    226314
Country/Region         0
Lat                 2286
Long                2286
Date                   0
Cases                  0
dtype: int64
In [9]:
recovered_df.isna().sum()
Out[9]:
Province/State    227457
Country/Region         0
Lat                 1143
Long                1143
Date                   0
Cases                  0
dtype: int64

Most of the Province/State values are missing, therefore we are going to remove this column and group the Cases by the Country/Region column. We are also removing the Lat and Long rows, since we will be using the pycountry package to plot some maps visualizations.

In [10]:
confirmed_df = confirmed_df.iloc[:, [1, 4, 5]]
death_df = death_df.iloc[:, [1, 4, 5]]
recovered_df = recovered_df.iloc[:, [1, 4, 5]]
confirmed_df['Cases'] = confirmed_df.groupby(['Country/Region', 'Date'])['Cases'].transform('sum')
confirmed_df = confirmed_df.drop_duplicates(subset=['Country/Region', 'Date', 'Cases']).reset_index(drop=True)
confirmed_df.tail()
Out[10]:
Country/Region Date Cases
229738 West Bank and Gaza 3/9/23 703228
229739 Winter Olympics 2022 3/9/23 535
229740 Yemen 3/9/23 11945
229741 Zambia 3/9/23 343135
229742 Zimbabwe 3/9/23 264276
In [11]:
death_df['Cases'] = death_df.groupby(['Country/Region', 'Date'])['Cases'].transform('sum')
death_df = death_df.drop_duplicates(subset=['Country/Region', 'Date', 'Cases']).reset_index(drop=True)
death_df.tail()
Out[11]:
Country/Region Date Cases
229738 West Bank and Gaza 3/9/23 5708
229739 Winter Olympics 2022 3/9/23 0
229740 Yemen 3/9/23 2159
229741 Zambia 3/9/23 4057
229742 Zimbabwe 3/9/23 5671
In [12]:
recovered_df['Cases'] = recovered_df.groupby(['Country/Region', 'Date'])['Cases'].transform('sum')
recovered_df = recovered_df.drop_duplicates(subset=['Country/Region', 'Date', 'Cases']).reset_index(drop=True)
recovered_df.tail()
Out[12]:
Country/Region Date Cases
229738 West Bank and Gaza 3/9/23 0
229739 Winter Olympics 2022 3/9/23 0
229740 Yemen 3/9/23 0
229741 Zambia 3/9/23 0
229742 Zimbabwe 3/9/23 0

Now all the dataframes have the same number of rows (229743). For each dataframe, the column Cases represents the accumulated number of cases for a specific country. Let us create a column for the daily new cases. For doing so, we will need to order the rows by the Date field. Therefore we have to change its type to datetime.

In [13]:
confirmed_df['Date'] = pd.to_datetime(confirmed_df['Date'])
death_df['Date'] = pd.to_datetime(death_df['Date'])
recovered_df['Date'] = pd.to_datetime(recovered_df['Date'])
confirmed_df = confirmed_df.sort_values(['Country/Region', 'Date']).reset_index(drop=True)
death_df = death_df.sort_values(['Country/Region', 'Date']).reset_index(drop=True)
recovered_df = recovered_df.sort_values(['Country/Region', 'Date']).reset_index(drop=True)

confirmed_df['New_Cases'] = confirmed_df.groupby('Country/Region')['Cases'].transform('diff')
confirmed_df.head()
Out[13]:
Country/Region Date Cases New_Cases
0 Afghanistan 2020-01-22 0 NaN
1 Afghanistan 2020-01-23 0 0.0
2 Afghanistan 2020-01-24 0 0.0
3 Afghanistan 2020-01-25 0 0.0
4 Afghanistan 2020-01-26 0 0.0
In [14]:
death_df['New_Cases'] = death_df.groupby('Country/Region')['Cases'].transform('diff')
death_df.head()
Out[14]:
Country/Region Date Cases New_Cases
0 Afghanistan 2020-01-22 0 NaN
1 Afghanistan 2020-01-23 0 0.0
2 Afghanistan 2020-01-24 0 0.0
3 Afghanistan 2020-01-25 0 0.0
4 Afghanistan 2020-01-26 0 0.0
In [15]:
recovered_df['New_Cases'] = recovered_df.groupby('Country/Region')['Cases'].transform('diff')
recovered_df.head()
Out[15]:
Country/Region Date Cases New_Cases
0 Afghanistan 2020-01-22 0 NaN
1 Afghanistan 2020-01-23 0 0.0
2 Afghanistan 2020-01-24 0 0.0
3 Afghanistan 2020-01-25 0 0.0
4 Afghanistan 2020-01-26 0 0.0

Now, for each dataframe, the column New_Cases will have a null value for every single country. In order to correct this, we have to replace the null values with its correspondent value in the Cases column.

In [16]:
confirmed_df['New_Cases'] = np.where(confirmed_df['New_Cases'].isna(), confirmed_df['Cases'], confirmed_df['New_Cases'])
confirmed_df.head()
Out[16]:
Country/Region Date Cases New_Cases
0 Afghanistan 2020-01-22 0 0.0
1 Afghanistan 2020-01-23 0 0.0
2 Afghanistan 2020-01-24 0 0.0
3 Afghanistan 2020-01-25 0 0.0
4 Afghanistan 2020-01-26 0 0.0
In [17]:
death_df['New_Cases'] = np.where(death_df['New_Cases'].isna(), death_df['Cases'], death_df['New_Cases'])
death_df.head()
Out[17]:
Country/Region Date Cases New_Cases
0 Afghanistan 2020-01-22 0 0.0
1 Afghanistan 2020-01-23 0 0.0
2 Afghanistan 2020-01-24 0 0.0
3 Afghanistan 2020-01-25 0 0.0
4 Afghanistan 2020-01-26 0 0.0
In [18]:
recovered_df['New_Cases'] = np.where(recovered_df['New_Cases'].isna(), recovered_df['Cases'], recovered_df['New_Cases'])
recovered_df.head()
Out[18]:
Country/Region Date Cases New_Cases
0 Afghanistan 2020-01-22 0 0.0
1 Afghanistan 2020-01-23 0 0.0
2 Afghanistan 2020-01-24 0 0.0
3 Afghanistan 2020-01-25 0 0.0
4 Afghanistan 2020-01-26 0 0.0

To finish our data cleaning process, we are going to merge the three dataframes, change the Country/Region column name to Country, get the country codes, remove the non country cases and save the data in this tidy format in a csv file.

In [19]:
covid_df = confirmed_df.merge(death_df, on=['Country/Region', 'Date'], suffixes=['_Confirmed', '_Death'])
covid_df = covid_df.merge(recovered_df, on=['Country/Region', 'Date'])
covid_df = covid_df.rename(columns={'Country/Region':'Country', 'Cases':'Cases_Recovered', 'New_Cases':'New_Cases_Recovered'})
covid_df.tail()
Out[19]:
Country Date Cases_Confirmed New_Cases_Confirmed Cases_Death New_Cases_Death Cases_Recovered New_Cases_Recovered
229738 Zimbabwe 2023-03-05 264127 0.0 5668 0.0 0 0.0
229739 Zimbabwe 2023-03-06 264127 0.0 5668 0.0 0 0.0
229740 Zimbabwe 2023-03-07 264127 0.0 5668 0.0 0 0.0
229741 Zimbabwe 2023-03-08 264276 149.0 5671 3.0 0 0.0
229742 Zimbabwe 2023-03-09 264276 0.0 5671 0.0 0 0.0
In [20]:
code_map = {'Brunei':'BRN',
           'Burma':'MMR',
           'Congo (Brazzaville)':'COG',
           'Congo (Kinshasa)':'COD',
           "Cote d'Ivoire":'CIV',
           'Holy See':'VAT',
           'Iran':'IRN',
           'Korea, North':'PRK',
           'Korea, South':'KOR',
           'Kosovo':'XXK',
           'Laos':'LAO',
           'Micronesia':'FSM',
           'Russia': 'RUS',
           'Syria':'SYR',
           'Taiwan*':'TWN',
           'West Bank and Gaza':'PSE'}
def get_country_code(name):
    try:
        return pycountry.countries.lookup(name).alpha_3
    except:
        try:
            return code_map[name]
        except:
            None
covid_df['Code'] = covid_df['Country'].apply(get_country_code)
covid_df.tail()
Out[20]:
Country Date Cases_Confirmed New_Cases_Confirmed Cases_Death New_Cases_Death Cases_Recovered New_Cases_Recovered Code
229738 Zimbabwe 2023-03-05 264127 0.0 5668 0.0 0 0.0 ZWE
229739 Zimbabwe 2023-03-06 264127 0.0 5668 0.0 0 0.0 ZWE
229740 Zimbabwe 2023-03-07 264127 0.0 5668 0.0 0 0.0 ZWE
229741 Zimbabwe 2023-03-08 264276 149.0 5671 3.0 0 0.0 ZWE
229742 Zimbabwe 2023-03-09 264276 0.0 5671 0.0 0 0.0 ZWE
In [21]:
covid_df = covid_df[~covid_df['Code'].isna()].reset_index(drop=True)
covid_df.to_csv('data/covid_tidy_data.csv', index=False, float_format='%.0f')
covid_df.tail()
Out[21]:
Country Date Cases_Confirmed New_Cases_Confirmed Cases_Death New_Cases_Death Cases_Recovered New_Cases_Recovered Code
225166 Zimbabwe 2023-03-05 264127 0.0 5668 0.0 0 0.0 ZWE
225167 Zimbabwe 2023-03-06 264127 0.0 5668 0.0 0 0.0 ZWE
225168 Zimbabwe 2023-03-07 264127 0.0 5668 0.0 0 0.0 ZWE
225169 Zimbabwe 2023-03-08 264276 149.0 5671 3.0 0 0.0 ZWE
225170 Zimbabwe 2023-03-09 264276 0.0 5671 0.0 0 0.0 ZWE

We ended up with 225171 rows.

Exploratory data analysis¶

In this part of the project, we are going to get some general information and plot several data charts which will later be used in the final report. Let us start by seeing the total number of confirmed, death and recovered cases around the world.

In [22]:
covid_df = pd.read_csv('data/covid_tidy_data.csv')
confirmed_total = covid_df.query('New_Cases_Confirmed > 0')['New_Cases_Confirmed'].sum()
death_total = covid_df.query('New_Cases_Death > 0')['New_Cases_Death'].sum()
recovered_total = covid_df.query('New_Cases_Recovered > 0')['New_Cases_Recovered'].sum()

WIDTH = 210
HEIGHT = 297

layout = go.Layout(
    autosize=False,
    width=(WIDTH-10)*7,
    height=100,
)

fig = go.Figure(layout=layout)

fig.add_trace(go.Indicator(
    mode = "number",
    value = confirmed_total/10**6,
    number = {"valueformat": ".2f", "suffix":'M', 'font':{'size':56}, 'font_color':'black'},
    title = {"text": "Global confirmed cases", 'font_color':'black'},
    domain = {'x': [0, 0.3], 'y': [0, 1]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = death_total/10**6,
    number = {"valueformat": ".2f", "suffix":'M', 'font':{'size':56}, 'font_color':'black'},
    title = {"text": "Global death cases", 'font_color':'black'},
    domain = {'x': [0.33, 0.66], 'y': [0, 1]}))

fig.add_trace(go.Indicator(
    mode = "number",
    value = recovered_total/10**6,
    number = {"valueformat": ".2f", "suffix":'M', 'font':{'size':56}, 'font_color':'black'},
    title = {"text": "Global recovered cases", 'font_color':'black'},
    domain = {'x': [0.7, 1], 'y': [0, 1]}))

fig.write_image("images/kpis.png")
fig.show()

Now let us create a world map and color the countries according the their number of cases.

In [23]:
country_covid_df = covid_df.groupby(["Country", "Code"])['New_Cases_Confirmed'].sum().reset_index()
fig = px.choropleth(country_covid_df,                           
                     locations="Code",       
                     color="New_Cases_Confirmed",                   
                     hover_name="Country",             
                     #animation_frame="Date", 
                     title="Global Cases",
                     projection="natural earth",      
                     color_continuous_scale = 'Peach',  
                     range_color=[0, 100000000],    
                     labels={'New_Cases_Confirmed':'Number of cases'}
                     )  
fig.write_image("images/global_cases_map.png")
fig.show()   

Let us see the daily evolution of confirmed, death, and recovered cases.

In [24]:
fig, ax = plt.subplots(figsize=(8,2), dpi=150)
sns.lineplot(data=covid_df.groupby('Date')['New_Cases_Confirmed'].sum().reset_index(),
             x='Date', y='New_Cases_Confirmed', ax=ax)

ax.set_title("Confirmed Cases by Day")
ax.set_xticks(["2020-01-30", "2021-01-30", "2022-01-30","2023-01-30"])
ax.set_ylabel("Cases")
plt.savefig("images/Confirmed_lineplot.png")
plt.show()
In [25]:
fig, ax = plt.subplots(figsize=(8,2), dpi=150)
sns.lineplot(data=covid_df.query("New_Cases_Death>0").groupby('Date')['New_Cases_Death'].sum().reset_index(),
             x='Date', y='New_Cases_Death', ax=ax, color="#AA5555")

ax.set_title("Death  Cases by Day")
ax.set_xticks(["2020-01-30", "2021-01-30", "2022-01-30","2023-01-30"])
ax.set_ylabel("Cases")
plt.savefig("images/death_lineplot.png")
plt.show()
In [26]:
fig, ax = plt.subplots(figsize=(8,2), dpi=150)
sns.lineplot(data=covid_df.query("New_Cases_Recovered>0").groupby('Date')['New_Cases_Recovered'].sum().reset_index(),
             x='Date', y='New_Cases_Recovered', ax=ax, color="#5555AA")

ax.set_title("Recovered Cases by Day")
ax.set_xticks(["2020-01-30", "2021-01-30", "2022-01-30"])
ax.set_ylabel("Cases")
plt.savefig("images/recovered_lineplot.png")
plt.show()

To finish this part of the project, let us get some table with the top five countries by confirmed cases, death cases and mortality.

In [27]:
country_covid_df = covid_df.groupby('Country')[['Cases_Confirmed', 'Cases_Death']].max().reset_index()
country_covid_df['Mortality'] = round(country_covid_df['Cases_Death']/country_covid_df['Cases_Confirmed'], 3)
fig, ax = plt.subplots(figsize=(4,2), dpi=150)


fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')

ax.set_title("Top 5 Countries - Confirmed Cases")
ax.table(cellText=country_covid_df.sort_values('Cases_Confirmed', ascending=False).head().values, colLabels=country_covid_df.columns, loc='upper center')


fig.tight_layout()
plt.savefig("images/top5_confirmed.png")
plt.show()
In [28]:
fig, ax = plt.subplots(figsize=(4,2), dpi=150)


fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')

ax.set_title("Top 5 Countries - Death Cases")
ax.table(cellText=country_covid_df.sort_values('Cases_Death', ascending=False).head().values, colLabels=country_covid_df.columns, loc='upper center')


fig.tight_layout()
plt.savefig("images/top5_death.png")
plt.show()
In [29]:
fig, ax = plt.subplots(figsize=(4,2), dpi=150)


fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')

ax.set_title("Top 5 Countries - Mortality")
ax.table(cellText=country_covid_df.query("Mortality < 1").sort_values('Mortality', ascending=False).head().values, colLabels=country_covid_df.columns, loc='upper center')


fig.tight_layout()
plt.savefig("images/top5_mortality.png")
plt.show()

Generating the report¶

In this part of the project we are going to use the FPDF package to generate a report in a pdf file that can be scheduled and sent to key users. Let us start by creating the file and putting the report title. We are going to use an A4 sheet format.

In [30]:
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", 'B', 28)
pdf.cell(0, 10, "Covid Analytics Report", align='C', ln=2)
pdf.set_font("Arial", '', 14)
pdf.cell(0, 10, "Mateus Melo", align='C')

Now let us fill the first page of the report.

In [31]:
pdf.image('images/kpis.png', x = 5, y = 40, w = WIDTH-10, h = 0, type = '', link = '')
pdf.image('images/global_cases_map.png', x = 5, y = 60, w = WIDTH-10, h = 0, type = '', link = '')
pdf.image('images/confirmed_lineplot.png', x = 5, y = 180, w = WIDTH-10, h = 0, type = '', link = '')
pdf.image('images/death_lineplot.png', x = 5, y = 240, w = WIDTH-10, h = 0, type = '', link = '')

Now let us fill the second page of the report and finish the report.

In [32]:
pdf.add_page()
pdf.image('images/recovered_lineplot.png', x = 5, y = 20, w = WIDTH-10, h = 0, type = '', link = '')
pdf.image('images/top5_confirmed.png', x = 30, y = 80, w = WIDTH-60, h = 0, type = '', link = '')
pdf.image('images/top5_death.png', x = 30, y = 140, w = WIDTH-60, h = 0, type = '', link = '')
pdf.image('images/top5_mortality.png', x = 30, y = 200, w = WIDTH-60, h = 0, type = '', link = '')
pdf.output("report.pdf", 'F')
Out[32]:
''

Conclusion¶

In this project we have generated an analytics reported with covid-19 cases around the world data. The method presented here may not be as embellished or practical as most some of the most famous dataviz tools, but it pays off in the financial and flexibility aspects.